http://www3.dsi.uminho.pt/pcortez/wine/
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
%config Completer.use_jedi = False
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sb
from plotly.subplots import make_subplots
#from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.metrics import classification_report, confusion_matrix
dfr = pd.read_csv('./data/winequality/winequality-red.csv', sep=';')
dfw = pd.read_csv('./data/winequality/winequality-white.csv', sep=';')
dfw['type'] = 'white'
dfr['type'] = 'red'
df = pd.concat([dfw, dfr])
display(df.head())
display(df.tail())
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 | white |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 | white |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 | white |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 | white |
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 | white |
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 | red |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 | red |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 | red |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 | red |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 | red |
df.shape
(6497, 13)
wine type - 1096 Red and 3451 White wine
fixed acidity - Most acids involved with wine or fixed or nonvolatile
volatile acidity - The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
residual sugar - The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides - The amount of salt in the wine
free sulfur dioxide - The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide - Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density - the density of wine is close to that of water depending on the percent alcohol and sugar content
pH - Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates - a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol - the percent alcohol content of the wine
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6497 entries, 0 to 1598 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 6497 non-null float64 1 volatile acidity 6497 non-null float64 2 citric acid 6497 non-null float64 3 residual sugar 6497 non-null float64 4 chlorides 6497 non-null float64 5 free sulfur dioxide 6497 non-null float64 6 total sulfur dioxide 6497 non-null float64 7 density 6497 non-null float64 8 pH 6497 non-null float64 9 sulphates 6497 non-null float64 10 alcohol 6497 non-null float64 11 quality 6497 non-null int64 12 type 6497 non-null object dtypes: float64(11), int64(1), object(1) memory usage: 710.6+ KB
We combined two different but very similar data sets. One that covered red wines and one that covered white wines. Very similar because the columns were exactly the same.
fig, ax = plt.subplots(figsize=(15, 8))
ax = sb.heatmap(df.corr(), annot=True)
fig = make_subplots(rows=1,
cols=2,
shared_yaxes=True,
horizontal_spacing=0.01,
)
fig.update_layout( height=1200,
showlegend=False,
title='Quality distribution per type'
)
fig.add_trace(go.Box(x=dfw['type'], y=dfw['quality']), row=1, col=1)
fig.add_trace(go.Box(x=dfr['type'], y=dfr['quality']), row=1, col=2)
There are no particular differences between the two rating distributions among different wine qualities. Although we combined two different data sets.
px.histogram(data_frame=df, x='quality', height=600, width=1200, title='<b>Quality distribution</b>')
The intent of this work is to be able to predict the quality of a wine based on its its organoleptic characteristics.
Assuming that there can be a direct correlation between a single characteristic and wine quality, can we already identify which characteristic might help us predict the value of quality?
Correlations between quality and a given combination of values of multiple characteristics cannot be identified by these indices. But this will have to be understood by the model we build.
If we look at the heat map below, we can identify which features have a higher correlation index. But they may not necessarily be the ones relevant to the prediction.
fig, ax = plt.subplots(figsize=(15, 8))
ax = sb.heatmap(df.corr(), annot=True)
dummies = pd.get_dummies(df['type'])
y = df['quality']
X_ = df.drop(['type', 'quality'], axis=1)
X = pd.concat([X_, dummies[['red']]], axis = 1)
X
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | red | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.270 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.8 | 0 |
| 1 | 6.3 | 0.300 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.5 | 0 |
| 2 | 8.1 | 0.280 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.1 | 0 |
| 3 | 7.2 | 0.230 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 0 |
| 4 | 7.2 | 0.230 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 1 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 1 |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 1 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 1 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 1 |
6497 rows × 12 columns
alphas = 10**np.linspace(2,-2,100)*0.5
alphas
array([5.00000000e+01, 4.55581378e+01, 4.15108784e+01, 3.78231664e+01,
3.44630605e+01, 3.14014572e+01, 2.86118383e+01, 2.60700414e+01,
2.37540508e+01, 2.16438064e+01, 1.97210303e+01, 1.79690683e+01,
1.63727458e+01, 1.49182362e+01, 1.35929412e+01, 1.23853818e+01,
1.12850986e+01, 1.02825615e+01, 9.36908711e+00, 8.53676324e+00,
7.77838072e+00, 7.08737081e+00, 6.45774833e+00, 5.88405976e+00,
5.36133611e+00, 4.88504979e+00, 4.45107543e+00, 4.05565415e+00,
3.69536102e+00, 3.36707533e+00, 3.06795364e+00, 2.79540509e+00,
2.54706901e+00, 2.32079442e+00, 2.11462144e+00, 1.92676430e+00,
1.75559587e+00, 1.59963357e+00, 1.45752653e+00, 1.32804389e+00,
1.21006413e+00, 1.10256537e+00, 1.00461650e+00, 9.15369140e-01,
8.34050269e-01, 7.59955541e-01, 6.92443186e-01, 6.30928442e-01,
5.74878498e-01, 5.23807876e-01, 4.77274228e-01, 4.34874501e-01,
3.96241449e-01, 3.61040451e-01, 3.28966612e-01, 2.99742125e-01,
2.73113861e-01, 2.48851178e-01, 2.26743925e-01, 2.06600620e-01,
1.88246790e-01, 1.71523464e-01, 1.56285792e-01, 1.42401793e-01,
1.29751211e-01, 1.18224471e-01, 1.07721735e-01, 9.81520325e-02,
8.94324765e-02, 8.14875417e-02, 7.42484131e-02, 6.76523887e-02,
6.16423370e-02, 5.61662016e-02, 5.11765511e-02, 4.66301673e-02,
4.24876718e-02, 3.87131841e-02, 3.52740116e-02, 3.21403656e-02,
2.92851041e-02, 2.66834962e-02, 2.43130079e-02, 2.21531073e-02,
2.01850863e-02, 1.83918989e-02, 1.67580133e-02, 1.52692775e-02,
1.39127970e-02, 1.26768225e-02, 1.15506485e-02, 1.05245207e-02,
9.58955131e-03, 8.73764200e-03, 7.96141397e-03, 7.25414389e-03,
6.60970574e-03, 6.02251770e-03, 5.48749383e-03, 5.00000000e-03])
ridge = Ridge(normalize=True)
coefs = []
for a in alphas:
ridge.set_params(alpha=a)
ridge.fit(X, y)
coefs.append(ridge.coef_)
We varied the regularization parameter to find the best value of alpha
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
Text(0, 0.5, 'weights')
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True) # con Ridge fa..., con RidgeCV oltre a fare quello che fa con Ridge(), sa gia' che deve fare ridge regressor
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.005487493827465278
The best alpha for the lasso is 1.12. What stands out is the enormous preponderance of one feature over the others.
ridge2 = Ridge(alpha=ridgecv.alpha_, normalize=True)
ridge2.fit(X_train, y_train)
pred2 = ridge2.predict(X_test)
print(pd.Series(ridge2.coef_, index = X.columns)) # stampiamo i coefficienti
print('MSE: ', mean_squared_error(y_test, pred2)) # Calcoliamo MSE
fixed acidity 0.078528 volatile acidity -1.516770 citric acid -0.128580 residual sugar 0.057925 chlorides -0.741894 free sulfur dioxide 0.004271 total sulfur dioxide -0.001070 density -93.193896 pH 0.439425 sulphates 0.675508 alcohol 0.230577 red 0.369690 dtype: float64 MSE: 0.5328691320131744
Our conjecture was wrong. Instead, the feature that seems to prevail is density, with overwhelming evidence. The MSE is about 0.53, a relatively low error. Let us now see what happens with Lasso.
alphas = 10**np.linspace(0,-10,100)*0.5
lasso = Lasso(max_iter = 10000, normalize = True)
coefs = []
for a in alphas:
lasso.set_params(alpha=a)
lasso.fit(X_train, y_train)
coefs.append(lasso.coef_)
ax = plt.gca()
ax.plot(alphas*2, coefs)
ax.set_xscale('log')
lasso = Lasso(max_iter = 10000, normalize = True)
lassocv = LassoCV(alphas = alphas, cv = 10, max_iter = 100000, normalize = True)
lassocv.fit(X_train, y_train)
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
mean_squared_error(y_test, lasso.predict(X_test))
lassocv.alpha_
1.1285098598169608e-05
print(pd.Series(lasso.coef_, index=X.columns))
print('MSE: ', mean_squared_error(y_test, pred2)) # Calcoliamo MSE
fixed acidity 0.081709 volatile acidity -1.523864 citric acid -0.123871 residual sugar 0.060114 chlorides -0.686217 free sulfur dioxide 0.004113 total sulfur dioxide -0.000996 density -98.253465 pH 0.454895 sulphates 0.675642 alcohol 0.226745 red 0.388612 dtype: float64 MSE: 0.5328691320131744
Density is the most relevant datum for our Lasso model. The MSE is 0.53, which is actually a good value for this datum.
Classification between white wine and red wine could be predicted.
df.index.values
array([ 0, 1, 2, ..., 1596, 1597, 1598], dtype=int64)
from sklearn.model_selection import train_test_split
X = df.drop(['type'], axis=1)
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score # NB: ci sono misure specifiche per classificatori e regressori
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predicted_train = model.predict(X_train) # per vedere se le risposte sono simili a quelle desiderate
predicted_test = model.predict(X_test)
print('Train accuracy')
print(accuracy_score(y_train, predicted_train)) # misuro accuratezza train set
print('Test score')
print(accuracy_score(y_test, predicted_test)) # misuro accuratezza test set
Train accuracy 1.0 Test score 0.9864615384615385
conf_mat = confusion_matrix(y_train, predicted_train)
print(conf_mat)
print(classification_report(y_test, predicted_test))
[[1215 0]
[ 0 3657]]
precision recall f1-score support
red 0.97 0.97 0.97 384
white 0.99 0.99 0.99 1241
accuracy 0.99 1625
macro avg 0.98 0.98 0.98 1625
weighted avg 0.99 0.99 0.99 1625
The data collected from the classifier report indicate that:
# Data handling
import pandas as pd
# Exploratory Data Analysis & Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# Model improvement and Evaluation
from sklearn import metrics
from sklearn.metrics import confusion_matrix
# Plotting confusion matrix
matrix = pd.DataFrame(conf_mat,
('Fake', 'Real'),
('Fake', 'Real'))
print(matrix)
# Visualising confusion matrix
plt.figure(figsize = (16,14),facecolor='white')
heatmap = sns.heatmap(matrix, annot = True, annot_kws = {'size': 20}, fmt = 'd', cmap = 'YlGnBu')
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation = 0, ha = 'right', fontsize = 18, weight='bold')
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation = 0, ha = 'right', fontsize = 18, weight='bold')
plt.title('Decision Tree\n', fontsize = 18, color = 'darkblue')
plt.ylabel('True type', fontsize = 14)
plt.xlabel('Predicted type', fontsize = 14)
plt.show()
Fake Real Fake 1215 0 Real 0 3657
The prediction made with the decision tree is particularly good. In fact, we notice that on the diagonal we find most of the results. The classification made with the type of wine is easy for the model to understand, perhaps because wines take on very different characteristics between red and white wine.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)
print(classification_report(y_test, y_pred))
[[ 373 11]
[ 3 1238]]
precision recall f1-score support
red 0.99 0.97 0.98 384
white 0.99 1.00 0.99 1241
accuracy 0.99 1625
macro avg 0.99 0.98 0.99 1625
weighted avg 0.99 0.99 0.99 1625
The data collected from the classifier report indicate that:
In general, we can say that the classifier seems to work well, to the point of being very close in terms of accuracy to the decision tree.
# Data handling
import pandas as pd
# Exploratory Data Analysis & Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# Model improvement and Evaluation
from sklearn import metrics
from sklearn.metrics import confusion_matrix
# Plotting confusion matrix
matrix = pd.DataFrame(conf_mat,
('Fake', 'Real'),
('Fake', 'Real'))
print(matrix)
# Visualising confusion matrix
plt.figure(figsize = (16,14),facecolor='white')
heatmap = sns.heatmap(matrix, annot = True, annot_kws = {'size': 20}, fmt = 'd', cmap = 'YlGnBu')
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation = 0, ha = 'right', fontsize = 18, weight='bold')
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation = 0, ha = 'right', fontsize = 18, weight='bold')
plt.title('K-Neighbors\n', fontsize = 18, color = 'darkblue')
plt.ylabel('True type', fontsize = 14)
plt.xlabel('Predicted type', fontsize = 14)
plt.show()
Fake Real Fake 373 11 Real 3 1238
For K-Neighbors the classification seems to work well, as we find a concentration on diagonal TP-TN values compared to the other FP-FNs. Apparently it seems to perform less well than the previous classifier (Decision Tree), although in reality, comparing the ratios described above (precision, recall and f1-score), we find that they are more or less equivalent.
We considered two very similar data sets, one of white wines and one of red wines, combined them and created a "type" column that we then used for classification.
Looking at the correlation indices, we noticed a high correlation between wine quality and alcohol quantity, only to find that it was instead density that was particularly useful in determining the quality of a wine.
Through linear regression, our model is able to predict wine quality with an accuracy of ~0.532 with the Ridge and Lasso regressors. Finally, we compared two classifiers (K-Neighbors and Decision Tree) to try to accurately determine the type of wine (white or red).